Electronic device and method for controlling thereof

ABSTRACT

An electronic device and a method for controlling thereof is provided. The electronic device includes a memory storing a neural network model and a processor configured to input, to the neural network model, input data to obtain output data, and, based on comparison between first output data based on input first modality and second output data based on input second modality, in response to the second modality being input, the neural network model is trained to output the first modality corresponding to the first output data, and the second modality may include at least one masking element.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a U.S. National Stage application under 35 U.S.C. §371 of an International application number PCT/KR2020/018985, filed onDec. 23, 2020, which is based on and claims priority of a Korean patentapplication number 10-2020-0139595, filed on Oct. 26, 2020, in theKorean Intellectual Property Office, the disclosure of which isincorporated by reference herein in its entirety.

TECHNICAL FIELD

This disclosure relates to an electronic device and a method forcontrolling thereof. More particularly, the disclosure relates to anelectronic device for obtaining output data through a neural networkmodel and a controlling method thereof.

BACKGROUND ART

Recently, an electronic device performing an automatic speechrecognition (ASR) function or a text to speech (TTS) function through aneural network model such as a deep neural network (DNN) has beendeveloped.

The ASR function refers to a function of transforming an audio signalinto text, and may be referred to as a speech to text (STT) function.The TTS function is a function of transforming text into an audio signaland outputting the audio signal.

In order to perform the ASR function, a related-art neural network modelperforms learning that outputs an appropriate text for an input audiosignal, and performs learning to output an appropriate audio signal forthe text input for execution of the TTS function.

However, when the neural network model is trained by inputting one ofthe text or the audio signal alone, it is possible to output improperdata due to the phonetic similarity or morphological similarity of atext.

For example, when a user utters “Tom” to output “Tom” which is a name ofa person, as a text, the related-art neural network model may output“tomb” which is phonetically similar to “Tom”, and if a user inputs“Tom” as a text to output “Tom”, which is a name of a person, as anaudio signal, the related-art neural network model may output an audiosignal for “tomb” after amendment of the text.

DISCLOSURE Technical Problem

It is an object of the disclosure to provide an electronic device thatmay distinguish an audio signal having a phonetic similarity or a texthaving a morphological similarity by training a neural network modelbased on the text and audio signal as input data, and a method forcontrolling thereof.

Technical Solution

In accordance with an aspect of the disclosure, an electronic device isprovided. The electronic device includes a memory storing a neuralnetwork model and a processor to input, to the neural network model,input data to obtain output data, and, based on comparison between firstoutput data based on input first modality and second output data basedon input second modality, in response to the second modality beinginput, the neural network model is trained to output the first modalitycorresponding to the first output data, and the second modality mayinclude at least one masking element.

One of the first modality and the second modality may be a text, andother one may be an audio signal.

The neural network model may tokenize the text into a plurality of textelements, segment the audio signal into a plurality of audio elements,and mask at least one of the plurality of text elements or at least oneof the plurality of audio elements.

The first modality may include a first text and the second modalitycomprises a first audio signal, and the neural network model is a modeltrained to output a second audio signal corresponding to the first textand a second text corresponding to the first audio signal with the firsttext composed of a plurality of tokenized text elements and the firstaudio signal in which at least one of segmented plurality of elements ismasked as input data, and based on a first audio signal comprising theat least one masking element being input based on the comparison of thesecond audio signal and the second text, output a first textcorresponding to the second audio signal.

The neural network model may perform learning, based on identificationthat the text corresponding to the second audio signal is not outputwith the output of the first audio signal including the at least onemasking element, based on comparison between a plurality of audioelements included in the second audio signal and a plurality of textelements included in the second text.

The neural network model may output a text element corresponding to themasking element through the learning.

The first modality may include a first audio signal and the secondmodality may include a first text, and the neural network model may be amodel trained to output a second text corresponding to the first audiosignal and a second audio signal corresponding to the first text, withthe first audio signal composed of a plurality of segmented audioelements and the first text in which at least one of the tokenizedplurality of elements is masked as input data, and based on a first textsignal comprising the at least one masking element being input based onthe comparison of the second text and the second audio signal, output afirst audio signal corresponding to the second text.

The neural network model may perform learning, based on identificationthat the audio signal corresponding to the second text is not outputwith the output of the first text including the at least one maskingelement, based on comparison between a plurality of text elementsincluded in the second text and a plurality of audio elements includedin the second audio signal.

The neural network model may output an audio element corresponding tothe masking element through the learning.

In accordance with another aspect of the disclosure, a method ofcontrolling an electronic device is provided. The method includesinputting input data to a neural network model and obtaining output datafor the input data through computation of the neural network model, and,based on comparison between first output data based on input firstmodality and second output data based on input second modality, inresponse to the second modality being input, the neural network model istrained to output the first modality corresponding to the first outputdata, and the second modality may include at least one masking element.

One of the first modality and the second modality may be a text, andother one may be an audio signal.

The text may be tokenized into a plurality of text elements, the audiosignal may be segmented into a plurality of audio elements, and at leastone of the plurality of text elements or at least one of the pluralityof audio elements may be masked and input to the neural network model.

The first modality may include a first text and the second modality mayinclude a first audio signal, and the neural network model is a modeltrained to output a second audio signal corresponding to the first textand a second text corresponding to the first audio signal with the firsttext composed of a plurality of tokenized text elements and the firstaudio signal in which at least one of segmented plurality of elements ismasked as input data, and based on a first audio signal comprising theat least one masking element being input based on the comparison of thesecond audio signal and the second text, output a first textcorresponding to the second audio signal.

The neural network model may perform learning, based on identificationthat the text corresponding to the second audio signal is not outputwith the output of the first audio signal including the at least onemasking element, based on comparison between a plurality of audioelements included in the second audio signal and a plurality of textelements included in the second text.

The neural network model may output a text element corresponding to themasking element through the learning.

The first modality may include a first audio signal and the secondmodality may include a first text, and the neural network model may be amodel trained to output a second text corresponding to the first audiosignal and a second audio signal corresponding to the first text, withthe first audio signal composed of a plurality of segmented audioelements and the first text in which at least one of the tokenizedplurality of elements is masked as input data, and based on a first textsignal comprising the at least one masking element being input based onthe comparison of the second text and the second audio signal, output afirst audio signal corresponding to the second text.

The neural network model may perform learning, based on identificationthat the audio signal corresponding to the second text is not outputwith the output of the first text including the at least one maskingelement, based on comparison between a plurality of text elementsincluded in the second text and a plurality of audio elements includedin the second audio signal.

The neural network model may output an audio element corresponding tothe masking element through the learning.

Effect of Invention

According to various embodiments as described above, an electronicdevice capable of distinguishing an audio signal having a phoneticsimilarity or a text having morphological similarity and a controlmethod thereof are provided.

DESCRIPTION OF DRAWINGS

FIG. 1 is a flowchart illustrating an operation of an electronic deviceaccording to an embodiment of the disclosure;

FIG. 2 is a diagram illustrating architecture of a hardware/softwaremodule constituting an electronic device according to an embodiment ofthe disclosure;

FIG. 3 is a diagram illustrating an embodiment of masking at least oneaudio element according to an embodiment of the disclosure;

FIG. 4 is a diagram illustrating an embodiment of masking at least onetext element according to an embodiment of the disclosure;

FIG. 5 is a diagram illustrating an operation of inputting a text and anaudio signal not corresponding to each other according to an embodimentof the disclosure;

FIG. 6 is a flowchart illustrating an embodiment of providing an ASRfunction through the neural network model trained according to anembodiment of the disclosure;

FIG. 7 is a flowchart illustrating an embodiment of providing a TTSfunction through the trained neural network function according to anembodiment of the disclosure;

FIG. 8 is a block diagram illustrating an electronic device according toan embodiment of the disclosure;

FIG. 9 is a detailed block diagram illustrating an electronic deviceaccording to an embodiment of the disclosure; and

FIG. 10 is a diagram illustrating a method for controlling an electronicdevice according to an embodiment of the disclosure.

BEST MODE FOR CARRYING OUT THE INVENTION

The terms used in the present specification and the claims are generalterms identified in consideration of the functions of embodiments of thedisclosure. However, these terms may vary depending on intention, legalor technical interpretation, emergence of new technologies, and the likeof those skilled in the related art. Some terms may be arbitrarilydefined herein by an Applicant. The term may be interpreted as themeaning defined in this disclosure, and unless there is a specificdefinition of a term, the term may be construed based on the overallcontents and technological common sense of those skilled in the relatedart.

In describing the disclosure, when it is decided that a detaileddescription for the known art related to the disclosure mayunnecessarily obscure the gist of the disclosure, the detaileddescription of the known art may be shortened or omitted.

As used herein, terms such as “first,” and “second,” may identifycorresponding components, regardless of importance or order, and areused to distinguish a component from another.

Also, the expression “configured to” used in the disclosure may beinterchangeably used with other expressions such as “suitable for,”“having the capacity to,” “designed to,” “adapted to,” “made to,” and“capable of,” depending on cases.

A term such as “module,” “unit,” and “part,” is used to refer to anelement that performs at least one function or operation and that may beimplemented as hardware or software, or a combination of hardware andsoftware.

The disclosure is further described in detail with reference toaccompanying drawings and the contents described in the accompanyingdrawings, but the disclosure is not limited thereto.

The disclosure will be described in detail with reference to theattached drawings.

FIG. 1 is a flowchart illustrating an operation of an electronic deviceaccording to an embodiment of the disclosure.

An electronic device 100 according to an embodiment is a device toobtain output data to input data using a neural network model and theelectronic device 100 may be, for example, a desktop personal computer(PC), a notebook, a smartphone, a tablet PC, a server, or the like.Alternatively, the electronic device 100 may be implemented as a systemitself in which clouding computer environment is established. In themeantime, the electronic device 100 is not limited to the above example,and any device capable of computing using an artificial intelligence(AI) model may be the device of the disclosure.

The electronic device 100 may perform learning of a neural networkmodel. The neural network model is an AI model including an artificialneural network and may be trained by deep learning. For example, theneural network model may include at least one of deep neural network(DNN) recurrent neural network (RNN), convolution neural network (CNN),or generative adversarial networks (GAN). The neural network model maybe an automatic speech recognition (ASR) model, a text to speech (TTS)model, a natural language processing (NLP) model, or the like, but isnot limited thereto.

The neural network model may be included in the electronic device 100 ina form of an on-device. This is merely exemplary, and the neural networkmodel may be included in an external device (e.g., server)communicatively connected to the electronic device 100.

Referring to FIG. 1, the electronic device 100 may input a plurality ofmodalities into a neural network model for learning of a neural networkmodel in operation S1110. Here, the plurality of modalities may be anaudio signal and a text as an example. The audio signal and the text maybe in a corresponding (or paired) relationship with each other. Forexample, the electronic device 100 may input, to a neural network model,a text “spoon” and an audio signal corresponding to the text “spoon” asinput data of the neural network model. The electronic device 100 maystore a speech transcript in which audio signals are matched for eachtext.

For learning, the neural network model may perform preprocessing on theinputted plurality of modalities. When an audio signal and a text areinputted, the neural network may segment the input audio signal into aplurality of audio elements in operation S1210, and may tokenize theinputted text into a plurality of text elements in operation S1220. Thesegmentation of the audio signal may be, for example, phoneticsegmentation, and the tokenization may be, for example, a tokenizationof a grapheme unit, but the embodiment is not limited thereto.

For example, when a text “spoon” is inputted, a neural network model mayobtain “s”, “p”, “oo”, and “n” by tokenizing the text “spoon” in agrapheme unit, and when an audio signal corresponding to the text“spoon” is inputted, an audio element may be phonetic-segmented and theneural network model may obtain an audio element corresponding to “s”,an audio element corresponding to “p”, an audio element corresponding to“oo”, and an audio element corresponding to “n”.

The neural network model may mask at least one of a plurality of audioelements or at least one of a plurality of text elements. The neuralnetwork model may replace at least one of the plurality of audioelements with a mask element, or replace at least one of the pluralityof text elements with a mask element.

For example, when the text “spoon” and the audio signal corresponding tothe text “spoon” are inputted as described above, the neural networkmodel may replace at least one of “s”, “p”, “oo”, and “n” obtainedthrough the tokenization of the text with a mask element. Alternatively,the neural network model may replace at least one of an audio elementcorresponding to “s” obtained through segmentation of an audio signal,an audio element corresponding to “p”, an audio element corresponding to“oo”, and an audio element corresponding to “n” with a mask element.

The neural network model may input a text composed of a plurality oftokenized text elements, and an audio signal in which at least one ofthe segmented plurality of audio elements is masked to an input layer ofa neural network model (e.g., a multi-modal model or a cross-modalmodel) in operation S1300. Alternatively, the neural network model mayinput an audio signal composed of a plurality of segmented audioelements and a text in which at least one of a plurality of tokenizedtext elements is masked to an input layer of the neural network model inoperation S1300.

The neural network model may perform learning with text composed of aplurality of tokenized text elements and an audio signal in which atleast one of the segmented plurality of audio elements is masked asinput data. Alternatively, the neural network model may perform learningwith an audio signal composed of a plurality of segmented audio elementsand a text in which at least one of a plurality of tokenized textelements is masked as input data.

It will be described that a text (hereinafter, referred to as a firsttext) composed of a plurality of tokenized text elements and an audiosignal in which at least one of the segmented plurality of audioelements is masked (hereinafter, first audio signal) are input.

In this case, the neural network model may output a second audio signalcorresponding to the first text through a neural network operation inoperation S1410 and output a second text corresponding to the firstaudio signal in operation S1420. Specifically, the neural network modelmay output a plurality of audio elements corresponding to a plurality oftext elements through a neural network computation with a plurality oftext elements as input data, and may output a plurality of text elementscorresponding to the mask elements and a plurality of text elementscorresponding to the plurality of unmasked audio elements through neuralnetwork computation, with at least one mask element and a plurality ofunmasked audio elements (according to an embodiment, the unmasked audioelement may be singular) as input data.

The neural network model may compare the output second audio signal(including a plurality of audio elements) and a second text (including aplurality of text elements) and may identify whether a second textcorresponding to the second audio signal is output in operation S1500.Specifically, the neural network model may compare a plurality of audioelements constituting the outputted second audio signal and a pluralityof text elements constituting the outputted second text. For example, iffirst to fourth audio elements are included in the second audio signaland fifth to eighth text elements are included in the second text, theneural network model may identify whether the fifth text elementcorresponds to the first audio element, identify whether the sixth textelement corresponds to the second audio element, identify whether theseventh text element corresponds to the third audio element, andidentify whether the eighth text element corresponds to the fourth audioelement.

When it is identified that the plurality of text elements constitutingthe second text do not correspond to the plurality of audio elementsconstituting the second audio signal, the neural network model mayperform learning in operation S1600-N. The neural network model maylearn to output a first text corresponding to a second audio signal whenthe first audio signal including the at least one masking element isinput as the input data. The learning may be performed as correcting atleast one weight of the plurality of layers constituting the neuralnetwork model to output the first text with the first audio as inputdata, and the computation of the weight may be performed by theprocessor of the electronic device 100.

In the embodiment described above, if the fifth text element correspondsto the first audio element, the sixth text element does not correspondto the second audio element, the seventh text element does notcorrespond to the third audio element, and the eighth text elementcorresponds to the fourth audio element, the neural network model mayperform learning to correct the weight of the plurality of layers thatconstitute the neural network model such that the sixth text element isoutput as the output for the second audio element and the seventh textelement is output as the output for the third audio element. Here, thesecond audio element and the third audio element may be masked elements.

If it is identified that a plurality of text elements constituting theoutputted second text correspond to a plurality of audio elementsconstituting the second audio signal after learning, the neural networkmodel may terminate learning in operation S1600-Y. Even before learning,if the plurality of text elements constituting the outputted second textcorrespond to a plurality of audio elements constituting the secondaudio signal, the neural network model may terminate a learningprocedure without performing learning.

If a text (hereinafter, first text) in which at least one of a pluralityof segmented audio signals (hereinafter, first audio signal) and theplurality of tokenized text elements is masked is input, the technicalidea similar to that of the technical idea described above may beapplied in the same manner.

The neural network model may output a second text corresponding to thefirst audio signal through a neural network operation in operation S1410and output a second audio signal corresponding to the first text inoperation S1420. Specifically, the neural network model may output aplurality of text elements corresponding to the plurality of audioelements with the plurality of audio elements as input data, and mayoutput audio elements corresponding to the mask elements and a pluralityof audio elements corresponding to the unmasked plurality of textelements with the at least one mask element and a plurality of unmaskedtext elements (according to an embodiment, the unmasked text element maybe singular) as input data.

The neural network model may compare the output second text (whichincludes a plurality of text elements) and the second audio signal(including a plurality of audio elements) to identify whether the secondaudio signal corresponding to the second text is output in operationS1500. The neural network model may compare a plurality of text elementsconstituting the outputted second text and a plurality of audio elementsconstituting the outputted second audio signal. For example, when thefirst to fourth audio elements are included in the second audio signaland the fifth to eighth text elements are included in the second text,the neural network model identify whether the first audio elementcorresponds to the fifth text element, identify whether the second audioelement corresponds to the sixth text element, identify whether thethird audio element corresponds to the seventh text element, andidentify whether the fourth audio element corresponds to the eighth textelement.

If it is identified that the plurality of audio elements constitutingthe second audio signal do not correspond to a plurality of textelements constituting the second text, the neural network model mayperform learning of the neural network model in operation S1600-N.Specifically, the neural network model may learn to output a first audiosignal corresponding to a second text signal when a first text includingthe at least one masking element is input. Here, the learning may be atask of correcting at least one weight of a plurality of layersconstituting a neural network model to output a first audio signal withthe first text as input data.

In the embodiment described above, if the first audio elementcorresponds to the fifth text element, the second audio element does notcorrespond to the sixth text element, the third audio element does notcorrespond to the seventh text element, and the fourth audio elementcorresponds to the eighth text element, the neural network model mayperform learning to correct the weight of the plurality of layers thatconstitute the neural network model such that the sixth audio element isoutput as the output of the second text element and the seventh audioelement is output as the output for the third text element. The secondtext element and the third text element may be masked elements.

When it is identified that the plurality of audio elements constitutingthe outputted second audio signal correspond to a plurality of textelements constituting the second text after learning, the neural networkmodel may terminate learning in operation S1600-Y. Even prior tolearning, if the plurality of audio elements constituting the outputtedsecond audio signal correspond to a plurality of text elementsconstituting the second text, the neural network model may terminate thelearning procedure without performing learning.

As such, the disclosure, by training the neural network model with thetext and audio signal as input data (this may be called cross-modalitylearning) may distinguish between audio signals or text havingsimilarities. For example, in order to learn “Tom” which is a name of aperson, by training the neural network model by masking at least one oftext elements “T”, “o”, “m” constituting the text “Tom”, audio elementcorresponding to “T”, audio element corresponding to “o”, and an audioelement corresponding to “m” the neural network model of the disclosuremay output the audio signal “Tom” as the text “Tom” and may preventerror of outputting “tomb” which has a phonetic similarity. In addition,by training the neural network model by masking at least one of theaudio element corresponding to “T”, the audio element corresponding to“o”, the audio element corresponding to “m” and at least one of “T”,“o”, and “m” which are text elements constituting “Tom”, the neuralnetwork model of the disclosure may output the text “Tom” as the audiosignal “Tom”, and may prevent outputting the audio signal “tomb” bycorrecting the text “Tom” to “tomb”.

Referring to FIG. 2, the neural network model will be described ingreater detail.

FIG. 2 is a diagram illustrating architecture of a hardware/softwaremodule constituting an electronic device according to an embodiment ofthe disclosure.

Referring to FIG. 2, the electronic device 100 includes a memory 20 andmay transmit the audio data stored in the memory 20 to an audio encoder30. The electronic device 100 may transmit text data stored in thememory 20 to a text encoder 40. Here, the transmitted audio data andtext data may be learning data for learning of a neural network model.The audio data transmitted to the audio encoder 30 may have arelationship (or pairing relationship) corresponding to the text datatransmitted to the text encoder 40.

The electronic device 100 of the disclosure may include a microphone 10and may transmit the user voice received through the microphone 10 tothe audio encoder 30. The user voice received through the microphone 10may be input to the neural network model together with the text in thelearning step, and may be input to the neural network model in aninference step after learning of the neural network model.

The audio encoder 30 may perform preprocessing of the audio signal.Specifically, the audio encoder 30 may remove noise of an audio signal(which may be the user voice or audio data described above), segment theaudio signal into a plurality of audio elements and transfer features ofthe plurality of audio elements. Here, noise removal may refer totransforming the audio signal into a frequency domain and thenextracting an area corresponding to a voice frequency. The embodiment isnot necessarily limited thereto, and various tools that may remove noiseincluded in the audio signal, such as noise canceling, or the like, maybe used for the noise removal. The segmentation of the audio signal maybe a phonetic segmentation operation that segments the audio signal intoa plurality of audio elements corresponding to a plurality of textelements. The feature transformation is to transform each audio elementinto a vector, and the electronic device 100 may store a plurality ofvectors corresponding to the plurality of audio elements.

The text encoder 40 may perform preprocessing of the text. The textencoder 40 may perform normalization of the text, tokenize the text intoa plurality of text elements, and perform a feature transformation forthe plurality of text elements. The normalization of the text may be atask of changing a capital letter to a small letter included in thetext, removing an unnecessary element included in the text (e.g., aspecial character that is not a natural language and has no particularmeaning, etc.), and the tokenization is segmenting a text into aplurality of text elements by a predetermined unit, here the unit may bea grapheme unit, but is not limited thereto. The feature transformationis to transform each text element into a vector, and the electronicdevice 100 may store a plurality of vectors corresponding to a pluralityof text elements.

The audio encoder 30 and the text encoder 40 may be, for example, a partof a neural network model as a software module. However, according to anembodiment, the audio encoder 30 and the text encoder 40 may beimplemented as a hardware module, and may be stored in the memory 20 asa software module separate from the neural network model.

The neural network model 50 (e.g., it may be referred to as across-modal model) may mask at least one of a plurality of audioelements (specifically, a plurality of vectors corresponding to theplurality of audio elements) generated by the audio encoder 30 or atleast one of a plurality of text elements (specifically, a plurality ofvectors corresponding to the plurality of audio elements) generated bythe text encoder 40.

The neural network model 50 may output a plurality of audio elementscorresponding to the plurality of text elements and a plurality of textelements corresponding to the plurality of audio elements in which atleast one element is masked, with the plurality of text elements and ataudio elements in which at least one element is masked as input data.Here, the plurality of output audio elements and the plurality of textelements may be represented by a vector.

The audio decoder 60 may transform a plurality of audio vectors outputby the neural network model 50 into a plurality of audio elements (whichmay be a wave signal or an analog signal) and the text decoder 70 maytransform the plurality of text vectors output by the neural networkmodel 50 into a plurality of text elements.

A discrimination module 80 may compare the plurality of audio elementsgenerated by the audio decoder 60 and the plurality of text elementsgenerated by the text decoder 70, and may identify whether the pluralityof audio elements and the plurality of text elements have acorresponding relationship (or pairing relationship). The discriminationmodule 80 may provide information on the identification result to theneural network model 50, and the neural network model 50 may performlearning to adjust the values of the plurality of weights constitutingthe neural network model 50 based on the information received from thediscrimination module 80.

The audio decoder 60 and the text decoder 70 may be part of a neuralnetwork model, as an example of a software module. However, according toan embodiment, the audio decoder 60 and the text decoder 70 may beimplemented as hardware modules and may be stored in the memory 20 as asoftware module separate from the neural network model.

The discrimination module 80 is a software module and may be a part of aneural network model, and may be implemented as a hardware moduleaccording to an embodiment, and may be stored in the memory 20 as asoftware module separate from the neural network model.

FIG. 3 is a diagram illustrating an embodiment of masking at least oneaudio element according to an embodiment of the disclosure.

The neural network model may receive a text and an audio signal in alearning stage. For example, the neural network model may receive a text“spoon” and an audio signal corresponding to the text “spoon”.

The neural network model may tokenize the text “spoon” to a plurality oftext elements via a text encoder. As an example, the neural networkmodel may obtain “s”, “p”, “oo”, and “n” by tokenizing the text “spoon”in a grapheme unit.

The neural network model may segment an audio signal corresponding to atext “spoon” through an audio encoder into a plurality of audioelements. For example, the neural network model may segment an audiosignal corresponding to “spoon” into an audio element corresponding to“s”, an audio element corresponding to “p”, an audio elementcorresponding to “oo”, and an audio element corresponding to “n”.

The neural network model may mask at least one of the plurality of audioelements. For example, referring to FIG. 3, the neural network model mayreplace an audio signal corresponding to “p” with a first mask element,and replace an audio signal corresponding to “oo” with a second maskelement.

The neural network model may input a plurality of text elements, atleast one mask element, and at least one audio element to an input layerto obtain output data. Specifically, the neural network model may outputa plurality of audio elements corresponding to a plurality of textelements by performing a computation of a neural network model with aplurality of text elements as input. As an example, the neural networkmodel may output an audio element corresponding to “s”, an audio elementcorresponding to “p”, an audio element corresponding to “oo”, and anaudio element corresponding to “n” with the input of “s”, “p”, “oo”, and“n”. The neural network model may output a plurality of text elementscorresponding to the at least one mask element and at least one audioelement by performing computation of the neural network model with atleast one mask element and at least one audio element as input. As anexample, the neural network model may output texts such as “s”, “p”,“o”, n” with the audio element corresponding to “s”, a first maskelement, a second mask element, and an audio element corresponding to“n” as input.

The neural network model may compare the output plurality of audioelements and the plurality of text elements. The neural network modelmay identify whether the plurality of audio elements and the pluralityof text elements output are in a corresponding relationship to eachother through a discriminator (this may be referred to as adiscriminator layer or a discrimination module).

The neural network model may identify at least one text element notcorresponding to a plurality of audio elements output among theplurality of outputted text elements. In the above-described embodiment,the neural network model may identify that the text “o” obtained as anoutput for the second mask element is a relationship that does notcorrespond to the audio element obtained as the output of the text “oo”.In this case, the neural network model may be trained to output aplurality of text elements corresponding to the output plurality ofaudio elements, that is, texts such as “s”, “p”, “oo”, and “n” with theaudio element corresponding to “s”, the first mask element, the secondmask element, and the audio element corresponding to “n” as input.

Through such learning, the neural network model may output theappropriate text as an input for the audio signal that includes themasking element, which may prevent an error of outputting a text that isdifferent from the user intention due to phonetic similarity.

FIG. 4 is a diagram illustrating an embodiment of masking at least onetext element according to an embodiment of the disclosure.

The neural network model may receive text and audio signals in alearning step. For example, the neural network model may receive thetext “spoon” and the audio signal corresponding to the text “spoon”.

The neural network model may tokenize the text “spoon” into a pluralityof text elements through the text encoder. For example, the neuralnetwork model may obtain “s”, “p”, “oo”, and “n” by tokenizing the text“spoon” in a grapheme unit.

The neural network model may segment an audio signal corresponding tothe text “spoon” into a plurality of audio elements through an audioencoder. For example, the neural network model may segment an audiosignal corresponding to “spoon” into an audio element corresponding to“s”, an audio element corresponding to “p”, an audio elementcorresponding to “oo”, and an audio element corresponding to n.

A neural network model may mask at least one of a plurality of textelements. For example, referring to FIG. 4, the neural network model mayreplace the text “p” with the first mask element and replace the text“oo” with the second mask element.

The neural network model may input at least one mask element, at leastone text element, and a plurality of audio elements to the input layerto obtain output data. Specifically, the neural network model may outputan audio element corresponding to each element by performing computationof the neural network model with at least one mask element and at leastone text element as input. For example, with the input of “s”, the firstmask element, the second mask element, and “n”, the neural network modelmay output an audio element corresponding to “s”, an audio elementcorresponding to the first mask element, an audio element correspondingto the second mask element, and an audio element corresponding to “n”.

The neural network model may input a plurality of audio elements tooutput a plurality of text elements corresponding to the plurality ofaudio elements by performing computation of the neural network model. Asan example, with an audio element corresponding to “s”, an audio elementcorresponding to “p”, an audio element corresponding to “oo”, and anaudio element corresponding to “n” as input, the neural network modelmay output text such as “s”, “p”, “oo”, and “n”.

The neural network model may compare the output plurality of audioelements and the plurality of text elements. Specifically, the neuralnetwork model may identify whether the outputted plurality of audioelements and the plurality of text elements are corresponding to eachother through a discriminator.

The neural network model may identify at least one audio element notcorresponding to the output plurality of text elements, among theplurality of output audio elements. In the above-described embodiment,the neural network model may identify that the audio elementcorresponding to the “o” obtained as the output for the second maskelement is in a relationship that does not correspond to the textelement obtained by the output of the audio element corresponding to“oo”. In this example, the neural network model may be trained to outputaudio elements corresponding to the plurality of output text elements,that is, an audio element corresponding to “s”, audio elementcorresponding to “p”, audio element corresponding to “oo”, and audioelement corresponding to “n”, with the text “s”, the first mask element,the second mask element, and the text “n” as input.

Through learning as above, the neural network model may output anappropriate audio signal as an input to the text including the maskingelement, which may prevent an error of outputting an audio signaldifferent from the user intention due to morphological similarity.

FIG. 5 is a diagram illustrating an operation of inputting a text and anaudio signal not corresponding to each other according to an embodimentof the disclosure.

In the embodiment above, an operation of the neural network model basedon an example of inputting a text and an audio signal in a correspondingrelationship is described.

However, text and audio signals that do not correspond to each other maybe input to the neural network model. As an example, referring to FIG.5, the text input for learning is “bloon” and the audio signal input forlearning is an audio signal corresponding to “spoon.”

In this example, the neural network model may terminate the learningprocedure without performing the cross-modality learning describedabove. The neural network model may identify whether the textcorresponds to an audio signal or whether an audio signal corresponds toa text prior to pre-processing (i.e., text torque or audio signalsegmentation, etc.) for training when a text and an audio signal areinput.

The neural network model, if it is identified that the text and theaudio signal are not in a corresponding relationship, may not performpre-processing for learning or learning, and may terminate a procedurefor learning.

The neural network model may perform learning for the input text and theaudio signal if it is identified that the text and the audio signal arein a corresponding relationship.

Accordingly, the neural network model may prevent errors of output datathat may occur due to learning of text and audio signals that are not ina corresponding relationship, and may prevent unnecessary computationsof the processor.

FIG. 6 is a flowchart illustrating an embodiment of providing an ASRfunction through the neural network model trained according to anembodiment of the disclosure.

The electronic device 100 may receive an audio signal corresponding tothe user voice through the microphone 10 in operation S610. The audiosignal corresponding to the user voice may be an analog signal (or wavesignal).

The electronic device 100 may input an audio signal corresponding to auser voice into the neural network model. In this case, the neuralnetwork model may perform preprocessing for audio signal processing.Specifically, the neural network model may remove noise included in theaudio signal in operation S620. Here, the removal of noise may be anexample of transforming an audio signal into a frequency domain andextracting an area corresponding to a voice frequency.

The neural network model may segment the audio signal into a pluralityof audio elements in operation S630. As an example, a neural networkmodel may perform phonetic segmentation for an audio signal.

The neural network model may perform a feature transformation of theplurality of audio elements in operation S640. Here, the featuretransformation is an operation of transforming each audio element into avector, so that the electronic device 100 may store a plurality ofvectors corresponding to the plurality of audio elements.

The neural network model may input a plurality of vectors to an inputlayer of a neural network model to perform a computation of the neuralnetwork model in operation S650, and generate a text corresponding tothe plurality of vectors in operation S660. The neural network model mayoutput text corresponding to the input audio signal based on the weightvalues included in the plurality of layers and the computation of theplurality of vectors inputted to the input layer.

FIG. 7 is a flowchart illustrating an embodiment of providing a TTSfunction through the trained neural network function according to anembodiment of the disclosure.

The electronic device 100 may receive text through an inputter (notshown) in operation S710. The inputter (not shown) may be a keyboard,for example, but is not limited thereto, and may be implemented withvarious devices capable of receiving user input, such as a touch screen,a touch pad, a soft keyboard, and the like.

The electronic device 100 may enter text into a neural network model. Inthis case, the neural network model may perform preprocessing for textprocessing. Specifically, the neural network model may perform thenormalization of the text in operation S720. The normalization of thetext may be changing of a capital character included in a text to asmall character, removing an unnecessary element included in the text(e.g., a special character that is not a natural language and has noparticular meaning, etc.) and the like.

The neural network model may tokenize the text into a plurality of textelements in operation S730. Here, the tokenization refers to segmentinga text into a plurality of text elements by a predetermined unit,wherein the unit may be a grapheme unit, but is not limited thereto.

The neural network model may perform a feature transformation of theplurality of text elements in operation S740. The feature transformationis an operation of transforming each text element into a vector, so thatthe electronic device 100 may store a plurality of vectors correspondingto a plurality of text elements.

The neural network model may input a plurality of vectors to an inputlayer of a neural network model to perform an operation of the neuralnetwork model in operation S750, and generate an audio signalcorresponding to the plurality of vectors in operation S760.Specifically, the neural network model may output an audio signalcorresponding to the input text based on the weight values included inthe plurality of layers and the computation of the plurality of vectorsinputted to the input layer.

FIG. 8 is a block diagram illustrating an electronic device according toan embodiment of the disclosure.

Referring to FIG. 8, the electronic device 100 according to anembodiment includes the memory 110 and the processor 120.

At least one instruction may be stored in the memory 110. An operatingsystem (O/S) for driving the electronic device 100 may be stored in thememory 110. The memory 110 may be stored with a software program orapplication for executing various embodiments of the disclosure. Thememory 110 may include a semiconductor memory such as a flash memory ora magnetic storage medium such as a hard disk.

A software module for executing various embodiments of the disclosuremay be stored in the memory 110, and the processor 120 may execute thesoftware module stored in the memory 110 to control the operation of theelectronic device 100. The memory 110 may be accessed by the processor120, and reading/writing/modifying/updating, or the like, of data by theprocessor 120 may be performed.

In the disclosure, the term memory 110 may be used to include read-onlymemory (ROM, not shown) in the processor 120, random access memory (RAM,not shown), or a memory card (not shown), (for example, a micro securedigital (SD) card, and a memory stick) mounted to the electronic device100.

In particular, the memory 110 may be stored with a neural network modeland a software module such as a text encoder for transforming text intoa vector, an audio encoder for transforming the audio signal into avector, a text decoder for transforming the vector into text, and anaudio decoder for transforming the vector to an audio signal, or thelike.

Various information required within a range for achieving the purpose ofthe disclosure may be stored in the memory 110, and information storedin the memory 110 may be received from an external device and updatedbased on the user input. For example, audio data and text data may bestored in the memory 110, and vector information corresponding to theaudio data and vector information corresponding to the text data may bestored.

The processor 120 controls the overall operation of the electronicdevice 100. Specifically, the processor 120 may control the operation ofthe electronic device 100 by executing at least one instruction storedin the memory 110.

The processor 120 may be implemented as at least one of an applicationspecific integrated circuit (ASIC), an embedded processor, amicroprocessor, hardware control logic, a hardware finite state machine(FSM), a digital signal processor (DSP), or the like. The term processor120 may be used to indicate a central processing unit (CPU), a graphicprocessing unit (GPU), a main processing unit (MPU), or the like.

The processor 120 may input text and audio signals into a neural networkmodel, read a plurality of weight values included in the plurality oflayers constituting the neural network model, and perform a neuralnetwork operation based on the input data and the weight value. Theneural network computation, wherein the output data may be an audiosignal corresponding to the input text or text corresponding to theinput audio signal.

FIG. 9 is a detailed block diagram illustrating an electronic deviceaccording to an embodiment of the disclosure.

Referring to FIG. 9, the electronic device 100 according to anembodiment may include the memory 110, the communicator 130, an inputter150, an outputter 160, and the processor 120. Hereinafter, the abovedescription and the overlapping portion will be omitted or describedwith reference to FIG. 3.

The communicator 130 may include a circuit and may communicate with anexternal device. Specifically, the processor 120 may receive variousdata or information from an external device connected through thecommunicator 130, and may transmit various data or information to anexternal device.

The communicator 130 may include at least one of a Wi-Fi module, aBluetooth module, a wireless communication module, and a near fieldcommunication (NFC) module. Each of the Wi-Fi module and the Bluetoothmodule may perform communication in a Bluetooth manner or in a Bluetoothmanner. The wireless communication module may communicate according tovarious communication specifications such as IEEE, Zigbee, 3^(rd)generation (3G), 3^(rd) generation partnership project (3GPP), long termevolution (LTE), 5th generation (5G), or the like. The NFC module maycommunicate by the NFC method using a 13.56 MHz band among various RF-IDfrequency bands such as 135 kHz, 13.56 MHz, 433 MHz, 860-960 MHz, 2.45GHz, or the like.

The outputter 160 includes a circuit, and the processor 120 may outputvarious information through the outputter 160. The outputter 160 mayinclude at least one of a display and a speaker.

The display may display various screens by the control of the processor120. As an example, the display may display text by control of theprocessor 120. Here, the text may be text output by the neural networkmodel.

The display may be implemented as a liquid crystal display panel (LCD),organic light emitting diode (OLED) display, or the like, and thedisplay may be implemented as a flexible display, a transparent display,or the like, according to use cases. The display according to thedisclosure is not limited to a specific type.

The speaker may output audio signal by the control of the processor 120.The audio signal may be an audio signal outputted by the neural networkmodel.

In various embodiments according to the disclosure, the processor 120may provide output data to the user via the outputter 160. The processor120 may visually provide output data to the user via the display, andmay provide output data to the user in the form of a voice signal viathe speaker.

The inputter 150 includes a circuit, and the processor 120 may receive auser command for controlling the operation of the electronic device 100through the inputter 150. Specifically, the inputter 150 may include amicrophone, a camera, or a signal receiver. The inputter 150 may beimplemented as a touch screen in a form included in the display.

In various embodiments according to the disclosure, the processor 120may receive a user command to initiate the operation of the processor120 in accordance with the disclosure via the inputter 150. Theprocessor 120 may receive a user command for providing output datacorresponding to the input data through the neural network model via theinputter 150.

The neural network model may include a plurality of neural networklayers. Each of the layers includes a plurality of weight values, andthe processor 120 may perform a neural network processing operationthrough an operation leveraging result of a previous layer and aplurality of weight values. Examples of a neural network includesconvolutional neural network (CNN), deep neural network (DNN), recurrentneural network (RNN), restricted Boltzmann machine (RBM), deep beliefnetwork (DBN), bidirectional recurrent deep neural network (BRDNN),generative adversarial networks (GAN), deep Q-networks, or the like, andthe neural network model of the disclosure is not limited to the aboveexample.

The processor 120 may train the neural network model through learningalgorithm. Examples of learning algorithms include, but are not limitedto, supervised learning, unsupervised learning, semi-supervisedlearning, or reinforcement learning, and the learning algorithm of thedisclosure is not limited to the above example.

FIG. 10 is a diagram illustrating a method for controlling an electronicdevice according to an embodiment of the disclosure.

The electronic device 100 may input first modality and second modalityto the neural network model in operation S1010. One of the firstmodality and the second modality may be a text, and other one may be anaudio signal.

The electronic device 100 may, based on comparison between first outputdata based on input first modality and second output data based on inputsecond modality, in response to the second modality being input, trainthe neural network model to output the first modality corresponding tothe first output data in operation S1020.

The electronic device 100 may tokenize the text into a plurality of textelements, and may segment the audio signal into a plurality of audioelements. The electronic device 100 may mask at least one of theplurality of text elements or at least one of the plurality of audioelements.

The electronic device 100 may input the first text composed of tokenizedtext elements and the first audio signal in which at least one of thesegmented plurality of audio elements is masked, to the neural networkmodel.

In this case, the neural network model may output a second audio signalcorresponding to the first text and a second text corresponding to thefirst audio signal.

The neural network model may learn to output a first text correspondingto a second audio signal when a first audio signal including the atleast one masking element is input based on a comparison of the secondaudio signal and the second text.

The neural network model may perform learning, based on identificationthat the text corresponding to the second audio signal is not outputwith the output of the first audio signal including the at least onemasking element, based on comparison between a plurality of audioelements included in the second audio signal and a plurality of textelements included in the second text.

The neural network model may output a text element corresponding to themasking element through the learning.

Alternatively, the electronic device 100 may input a first audio signalcomposed of a plurality of segmented audio elements and a first textmasked with at least one of a plurality of tokenized text elements intothe neural network model.

In this case, the neural network model may output a second textcorresponding to the first audio signal and a second audio signalcorresponding to the first text.

The neural network model may, based on the first text including at leastone masking element being input, be trained to output the first audiosignal corresponding to the second text based on the comparison betweenthe second text and the second audio signal.

The neural network model may perform learning, based on identificationthat the text corresponding to the second audio signal is not outputwith the output of the first audio signal including the at least onemasking element, based on comparison between a plurality of audioelements included in the second audio signal and a plurality of textelements included in the second text.

The neural network model may output a text element corresponding to themasking element through the learning.

The methods according to various embodiments of the disclosure describedabove may be implemented with only software/hardware upgrade for anexisting electronic device.

In addition, the various embodiments of the disclosure described abovemay be implemented through an embedded server provided in an electronicdevice, or an external server.

The controlling method of the electronic device according to variousembodiments may be implemented as a program and stored in variousrecording media. A computer program processed by various processors toexecute the various controlling methods described above may be used in astate in which a computer program capable of executing the variouscontrol methods described above is stored in a recording medium.

The non-transitory computer-readable medium does not refer to a mediumthat stores data for a short period of time, such as a register, cache,memory, etc., but semi-permanently stores data and is available ofreading by the device. Specifically, programs of performing theabove-described various methods can be stored in a non-transitorycomputer readable medium such as a CD, a DVD, a hard disk, a Blu-raydisk, universal serial bus (USB), a memory card, ROM, or the like, andcan be provided.

While the disclosure has been shown and described with reference tovarious embodiments thereof, it will be understood by those skilled inthe art that various changes in form and details may be made thereinwithout departing from the spirit and scope of the disclosure as definedby the appended claims and their equivalents.

1. An electronic device comprising: a memory storing a neural networkmodel; and a processor configured to input, to the neural network model,input data to obtain output data, wherein, based on comparison betweenfirst output data based on input first modality and second output databased on input second modality, in response to the second modality beinginput, to output the first modality corresponding to the first outputdata based on the neural network model, and wherein the second modalitycomprises at least one masking element.
 2. The electronic device ofclaim 1, wherein one of the first modality and the second modality is atext, and the other one of the first modality and the second modality isan audio signal.
 3. The electronic device of claim 2, wherein the neuralnetwork model is configured to: tokenize the text into a plurality oftext elements, segment the audio signal into a plurality of audioelements, and mask at least one of the plurality of text elements or atleast one of the plurality of audio elements.
 4. The electronic deviceof claim 1, wherein the first modality comprises a first text and thesecond modality comprises a first audio signal, and wherein the neuralnetwork model is configured to: output a second audio signalcorresponding to the first text and a second text corresponding to thefirst audio signal with the first text composed of a plurality oftokenized text elements and the first audio signal in which at least oneof segmented plurality of elements is masked as input data, and based ona first audio signal comprising the at least one masking element beinginput based on the comparison of the second audio signal and the secondtext, output a first text corresponding to the second audio signal. 5.The electronic device of claim 4, wherein the neural network modelperforms learning, based on identification that the text correspondingto the second audio signal is not output with the output of the firstaudio signal including the at least one masking element, based oncomparison between a plurality of audio elements included in the secondaudio signal and a plurality of text elements included in the secondtext.
 6. The electronic device of claim 4, wherein the neural networkmodel is configured to output a text element corresponding to the atleast one masking element.
 7. The electronic device of claim 1, whereinthe first modality comprises a first audio signal and the secondmodality comprises a first text, and wherein the neural network model isconfigured to: output a second text corresponding to the first audiosignal and a second audio signal corresponding to the first text, withthe first audio signal composed of a plurality of segmented audioelements and the first text in which at least one of tokenized pluralityof elements is masked as input data, and based on a first text signalcomprising the at least one masking element being input based on thecomparison of the second text and the second audio signal, output afirst audio signal corresponding to the second text.
 8. The electronicdevice of claim 7, wherein the neural network model performs learning,based on identification that the audio signal corresponding to thesecond text is not output with the output of the first text includingthe at least one masking element, based on comparison between aplurality of text elements included in the second text and a pluralityof audio elements included in the second audio signal.
 9. The electronicdevice of claim 7, wherein the neural network model is configured tooutput an audio element corresponding to the masking element throughtraining.
 10. A method of controlling an electronic device, the methodcomprising: inputting input data to a neural network model; andobtaining output data for the input data through computation of theneural network model, based on comparison between first output databased on input first modality and second output data based on inputsecond modality, in response to the second modality being input,outputting, by the neural network model, the first modalitycorresponding to the first output data, wherein the second modalitycomprises at least one masking element.
 11. The method of claim 10,wherein one of the first modality and the second modality is a text, andthe other one of the first modality and the second modality is an audiosignal.
 12. The method of claim 11, wherein the text is tokenized into aplurality of text elements, the audio signal is segmented into aplurality of audio elements, and at least one of the plurality of textelements or at least one of the plurality of audio elements are maskedand input to the neural network model.
 13. The method of claim 10,wherein the first modality comprises a first text and the secondmodality comprises a first audio signal, and wherein the method furthercomprises: outputting a second audio signal corresponding to the firsttext and a second text corresponding to the first audio signal with thefirst text composed of a plurality of tokenized text elements and thefirst audio signal in which at least one of segmented plurality ofelements is masked as input data, and based on a first audio signalcomprising the at least one masking element being input based on thecomparison of the second audio signal and the second text, outputting afirst text corresponding to the second audio signal.
 14. The method ofclaim 13, further comprising training the neural network model based onidentification that the text corresponding to the second audio signal isnot output with the output of the first audio signal including the atleast one masking element, based on comparison between a plurality ofaudio elements included in the second audio signal and a plurality oftext elements included in the second text.
 15. The method of claim 13,further comprising the neural network model outputting a text elementcorresponding to the masking element.